========================================================

## [1] "/Users/craig/Downloads/Udacity/White Wine Quality Project"
## [1] "L6 Project White Wine Analysis.Rmd" 
## [2] "L6_Project_White_Wine_Analysis.html"
## [3] "L6_Project_White_Wine_Analysis.Rmd" 
## [4] "wineQualityInfo.txt"                
## [5] "wineQualityWhites.csv"

Introduction

The white wine dataset explored in this report was created using variants of the Portuguese “Vinho Verde” wine. This dataset consists of 4898 observations (wine samples) and 13 variables including the ordinal data point, Quality. The input variables for samples include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].

Univariate Plots Section

In the effort to help guide this exploration, we’ll output the fundamental and summary statistics.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The results show our dataset consists of 4898 observations and 13 total variables.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The summary statistics (including the IQR, minimum and maximum values, median, and mean) provides more detail for each variable and will help guide this exploration as we seek to investigate single variables as well as their relationships. They will also be considered when compared with univariate analysis plots.

Quality Plot

Looking at the distribution of data for quality, a rating of 6 is the most frequent rating followed by 5 and 7 respectfully. So which factors have a determination on wine quality? Are there specific traits that make for a better quality white wine?

Residual Sugar Plots

## Warning: Removed 7 rows containing non-finite values (stat_bin).

The distribution of residual sugar appears to be skewed right with most wines consisting of below 10 grams per liter. How does residual sugar content vary among quality ratings, can we expect range or a specific sweet spot for higher quality wines?

Sulphates Plot

From the results of the plot we should notice the relatively slim margin included in the IQR, which is contained between 0.41-0.55 g/dm3. It doesn’t seem as though we are likely to will see much variation in sulphates for the wine samples, but we will continue our exploration of trends and observations.

pH Plot

We can see from this histogram for pH data a fairly symmetric distribution with a mean of 3.188, acidic but not quite as acidic as vinegar.

Alcohol Plot

The distribution for alcohol appears skewed right with most wines in the range of 9.5-11.4 percent alcohol content. Is this range a factor that is reflected in quality ratings?

Density Plot

## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Most wines have a density of about 0.9921 with a mean of 0.9937 and a median of 0.9940. Considering the amount of alcohol and sugars in a particular wine, the density of wine should be similar to the density of water. A transformation of long tail data was performed to better understand the distribution for density.

Citric Acid Plot

## Warning: Removed 16 rows containing non-finite values (stat_bin).

Citric acid has an IQR in the range of 0.27-0.39. Considering that citric acid can add “freshness” and flavor to wines, would it be correct to assume that this is added to lower-quality wines to improve the rating?

Fixed Acidity Plot

## Warning: Removed 4 rows containing non-finite values (stat_bin).

Interested in understanding the distribution of data for fixed acidity, the results display the IQR for fixed acidity is within the range 6.3-7.3.

Volatile Acidity Plot

## Warning: Removed 2 rows containing missing values (geom_bar).

A transformation of long tail data was performed to better understand the distribution of volatile acidity, we can see a relatively normal distribution. The transformed distribution appears symmetric with the volatile acidity peaking at 0.27 g/dm^3 or so.

Chlorides Plot

## Warning: Removed 2 rows containing missing values (geom_bar).

A transformation of long tail data was performed to better understand the distribution of chlorides, we can see a relatively normal distribution. The transformed distribution for chlorides appears symmetric with chlorides peaking at around 0.046 g/dm^3.

Total Sulfur Dioxide Plot

Sulfur dioxide is used as a preservative in wine and the histogram for total sulfur dioxide above shows a relatively normal distribution with a mean of 138.4 mg/dm^3. If there is dissolution above 50 ppm it becomes detectable in the wine, perhaps this is realized if we compare with quality ratings.

Free Sulfur Dioxide Plot

## Warning: Removed 41 rows containing non-finite values (stat_bin).

The histogram for free sulfur dioxide shows a shape that appears to be skewed right. The mean value for free sulfur dioxide is 46 mg/dm^3.

Univariate Analysis

What is the structure of your dataset?

There are 4,898 white wines sampled in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality).

Other observations: Most white wines feature a quality rating of 6. The median alcohol content percentage is 10.51. Most wines have a density of of about 0.9921 with a mean of 0.9937 and a median of 0.9940. About 75% of wines have residual sugar consisting of below 10 g/dm^3. Citric acid has an IQR in the range of 0.27-0.39 g/dm^3.

What is/are the main feature(s) of interest in your dataset?

The main point of interest in the dataset will be an attempt understand how the individual variables within relate to the discrete quality variable. For this part of the analysis I thought histograms would be best to provide the detail needed to start to understand our data. We created plots for each variable (excluding variable X). From the results of the histogram plots we can explore which balance of features are best for determining the discrete quality of a wine. I suspect there will be some common profiles for each quality rating, although determining a wine’s quality is dependent on the personal preferences of a wine steward.

Balance in taste is a key component in considering the quality of wine and individual preference in this matter is subjective. In our dataset balance will consider residual sugar, chlorides, volatile acidity, citric acid, and alcohol. I am interested in exploring whether there is any relationship among these individual variables and combinations of variables to better discern trends among the ordinal quality variable. If able to distinguish trends, it is possible that we could develop a predictive model for white wine quality.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Because free sulfur dioxide is used to prevent microbial growth and oxidation and sulphates to act as an antimicrobial and antioxidant of wine, perhaps we will be able to glean some interesting insights by comparing these variables against the quality variable later in the multivariate section.

Did you create any new variables from existing variables in the dataset?

No new variables were created from existing variables in the dataset.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

To obtain a clearer understanding for the distribution of density, volatile acidity and chlorides data, log10 was used to limit outliers in the data. Also a bit further down in the bivariate analysis section, the cut function is used to transform the quality rating data into a table for a better comparison with other variables in the bivariate and multivariate analysis sections.

Bivariate Plots Section

We will open the bivariate analysis section by cutting quality into new discrete variable carrying over their current rating score as titles. Then we will take a high-level view of correlation among combinations of variables, which is achieved by subsetting the dataset and removing variable “X”.

Quality Discrete Variables

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Scatterplot Matrix

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Verifying the correlation between the variables in the dataset we can identify strong correlation among certain pairs (both in agreement and disagreement). Let’s plot the variable combinations and perform a correlation test, looking at stronger correlation coefficients. They include the following.

Correlation Coefficient Agreement: total sulfur dioxide vs. free sulfur dioxide fixed acidity vs. density residual sugar vs total sulfur dioxide chlorides vs. density quality vs. alcohol density vs. residual sugar density vs. free sulfur dioxide density vs. total sulfur dioxide

Correlation Coefficient Disagreement: residual sugar vs alcohol alcohol vs. density

Correlation Coefficient Agreement

Total and Free Sulfur Dioxide

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

From the above results we see a correlation coefficient for the free sulfur dioxide, total sulfur dioxide pair as 0.615501, and a 95 percent confidence interval of 0.5977994 - 0.6326026.

Fixed Acidity and Density

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and density
## t = 19.256, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2391013 0.2911738
## sample estimates:
##      cor 
## 0.265331

From these results we see a weaker correlation coefficient for the fixed acidity and density pair at 0.265331. The 95 percent confidence interval of 0.2391013 - 0.2911738.

Residual Sugar and Total Sulfur Dioxide

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and total.sulfur.dioxide
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3776791 0.4246712
## sample estimates:
##       cor 
## 0.4014393

From these results we see a decent correlation coefficient for the variable pair at 0.4014393. I wonder how total sulfur dioxide affects wine quality? The 95 percent confidence interval of 0.3776791 - 0.4246712.

Chlorides and Density

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 62 rows containing non-finite values (stat_smooth).
## Warning: Removed 62 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  chlorides and density
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2308679 0.2831779
## sample estimates:
##       cor 
## 0.2572113

The plot above shows a moderate correlation between variable pairs, density and chlorides, with the correlation coefficient at 0.2572113. Chlorides may be more closely associated with wine taste rather than having a big affect on the density of the wine.

Residual Sugar and Density

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

Residual sugar appears to have more of an impact on density. This variable pair shows a strong correlation coefficient of 0.8389665, and a 95 percent confidence interval of 0.8304732 - 0.8470698.

Total Sulfur Dioxide and Density

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5094349 0.5497297
## sample estimates:
##       cor 
## 0.5298813

The variable pair for total sulfur dioxide and density appears to show a fairly strong correlation, at 0.5298813.

Fixed Acidity and Density

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and density
## t = 21.54, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2684156 0.3195836
## sample estimates:
##       cor 
## 0.2942104

This variable pair for fixed acidity and density appears to show a weaker correlation, at 0.2942104.

Correlation Coefficient Disagreement

Residual Sugar and Alcohol

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

In examing a few variable pairs with correlation disagreement we looked at two variable pairs with the strongest disagreement. The variable pair for residual sugar and alcohol features a correlation at -0.4506312, which appears to indicate that as alcohol content increases, residual sugar tends to decrease. I wonder how this factor plays out when compared with quality ratings?

Alcohol and Density

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_smooth).

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

This variable pair, alcohol and density, has a correlation coefficient of -0.7801376, indicating that density decreases as alcohol content increases. To help visualize this relationship better we created a geom_line plot in addition to the geom_point plot.

Quality Box Plots

Next let’s survey quality box plots for the variables associated with wine balance.

Quality and Alcohol

## w_w$quality: (2,3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## w_w$quality: (3,4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## w_w$quality: (4,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## w_w$quality: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## w_w$quality: (6,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## w_w$quality: (7,8]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## w_w$quality: (8,9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

As we can see from the results there is generally a postivie trend, as alcohol increases so does the quality of wine. The mean alcohol content for wine with a quality rating of 9 is 12.18.

Quality and Citric Acid

## w_w$quality: (2,3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2100  0.2575  0.3450  0.3360  0.3850  0.4700 
## -------------------------------------------------------- 
## w_w$quality: (3,4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1900  0.2900  0.3042  0.4000  0.8800 
## -------------------------------------------------------- 
## w_w$quality: (4,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2400  0.3200  0.3377  0.4100  1.0000 
## -------------------------------------------------------- 
## w_w$quality: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.320   0.338   0.380   1.660 
## -------------------------------------------------------- 
## w_w$quality: (6,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.2800  0.3100  0.3256  0.3600  0.7400 
## -------------------------------------------------------- 
## w_w$quality: (7,8]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.2800  0.3200  0.3265  0.3600  0.7400 
## -------------------------------------------------------- 
## w_w$quality: (8,9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   0.340   0.360   0.386   0.450   0.490

From this box plot we get to explore the variables quality with citric acid. the results indicate that citric acid content in wine remains relatively level across various wine quality ratings. There appears to be more outliers associated with wine rated with a 6 in quality.

Quality and Volatile Acidity

## w_w$quality: (2,3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## w_w$quality: (3,4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## w_w$quality: (4,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## w_w$quality: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## w_w$quality: (6,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## w_w$quality: (7,8]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## w_w$quality: (8,9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

Indicated by the plot results, volatile acidity mean is highes among wines rated with a 4 quality rating.

Quality and Chlorides

## w_w$quality: (2,3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## w_w$quality: (3,4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## w_w$quality: (4,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## w_w$quality: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## w_w$quality: (6,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## w_w$quality: (7,8]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## w_w$quality: (8,9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

In this box plot for variables quality and chlorides we see a generally flat trend with consistent chloride levels throughout each quality rating. There is however, a slight dip in chlorides at the 7, 8, and 9 quality ratings. The mean chloride content level for a 9 quality rating is 0.0274.

Quality and Residual Sugar

## w_w$quality: (2,3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## -------------------------------------------------------- 
## w_w$quality: (3,4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## w_w$quality: (4,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## w_w$quality: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## w_w$quality: (6,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## w_w$quality: (7,8]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## w_w$quality: (8,9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

The results for variable pairs quality and residual sugar appear varied. An interesting point worht noting here is the mean residual sugar content for quality rating 4 (4.628) and 9 (4.12). Both ratings have similar levels of residual sugar. I’m not quite sure why this would be the case, perhaps this variable is affected more by personal preference than any other factor variable?

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features

in the dataset?

With this survey we explored quality rating (dependent) and the other variables affecting wine balance (independent), the five independent variables draw some interesting insights. The utilization of scatter plots with correlation coefficiants allowed us to examine for any patterns among independent variables, while the box plots helped to provide insights from these variables in order to see how they stack up against the dependent quality variable. This will help us later when we atttempt to explore various profiles of multiple variables.

One relationship worth noting is alcohol and quality. As alcohol increases, quality ratings appear to increase generally. Another interesting observation is that the means for citric acid and volatile acidity appear to remain flat as quality ratings improve. Also, chloride means tend to decrease slightly as quality ratings increase. Means for residual sugar fluctuate among quality ratings.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Some stronger correlation coefficients exist among density and residual sugar (0.882), total sulfur dioxide and free sulfur dioxide (0.724), and alcohol and density (-0.827). These stood out to me and will be used to further understand our data. I was expecting that free sulfur dioxide would content would vary depending on the relationship of other flavor variables, but it appears that the amount of free sulfur dioxide may be more related to wine volume.

What was the strongest relationship you found?

The strongest correlation, as mentioned above, exist among density and residual sugar, which suggests that as residual sugar content increases so does density. However we will continue in the multivariate plots section with our exploration of how balance is expressed among quality ratings. These efforts will help fill in additional understang of distinct relationships.

Multivariate Plots Section

Alcohol and Residual Sugar - Quality

From the multivariate plot above, when we look at alcohol content against residual sugar we’ll see that quality ratings generally trend with higher alcohol and lower residual sugar content. Alternately lower quality ratings show a general profile of more sugar and less alcohol.

Considering this combination of variables, we can say that white wines with higher residual sugar tend to have lower quality ratings, while white wines with high alcohol content tend to have higher quality ratings.

Residual Sugar and Density - Quality

From the multivariate plot above, when we look at residual sugar content against density we’ll see that quality ratings generally trend with a lower residual sugar content and lower density. Considering this combination of variables, we can say that white wines with higher residual sugar content tend to have lower quality ratings, while white wines with a higher density also tend to have lower quality ratings.

Volatile Acidity and Alcohol - Quality

Along with the trend in alcohol percent content related with quality rating identified earlier in the multivariate section (as alcohol content increases, quality rating also tends to increase), we can see from the results above that volatile acidity doesn’t have a strong correlation with quality in our dataset. This leads me to think that volatile acidity is controlled across all quality ratings. I wonder, if there are wines outside this controlled range of volatile acidity would they be used for vinegar?

Citric Acid and Residual Sugar - Quality

## `geom_smooth()` using method = 'gam'

Based on the results of the multivariate plot above, when we look at residual sugar content with citric acid we can see that higher quality ratings generally trend within a certain citric acid content and dispersed across a range of residual sugar content. Lower quality ratings for citric acid are dispersed over a wider range. Considering these results for this combination of variables, we can say that white wines rated with a higher quality have a tighter range of citric acid content compared with wines of lower quality rating.

Chlorides and Alcohol - Quality

A trend can be interpreted from the above results that, outside of a few outliers, higher quality wine tends to have a lower chloride content, while the inverse tends to hold up for lower quality wines.

Free Sulfur Dioxide and Sulphates - Quality

In the hope to examine the preservation of wine flavor and oxidation control, we can make the assertion from the above results that sulphate content is dispersed throughout quality ratings. Higher quality wines tend to have higher free sulfur dioxide content. However, the distribution for free sulfur dioxide generally appears in variation.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

The multivariate correlation plots used in this section help us understand a more detail (identified trends or a lack of trend) among multiple combinations of variables. Specifically, in having the plots colored by quality, we were able to distinguish the impact multiple variables have in determining wine quality.

From the resulting plots in this section we can understand that higher quality wines generally feature higher alcohol and free sulfur dioxide content, and lower chloride and residual sugar content. Citric acid for higher quality white wines exists in a relatively strict concentration range. Residual sugar has a strong correlation with density, which again should be relatively low. As residual sugar increases so does density. When we see residual sugar compared with quality ratings however, there tends to be some variation but an overall trend of lower residual sugar content for higher quality ratings. These results help us attain a fairly decent profile of how well the variables we explored impact quality ratings

Were there any interesting or surprising interactions between features?

I wasn’t expecting the results attained from the free sulfur dioxide, sulphates, and quality multivariate analysis. I found it intriguing yet sensible that only a certain amount of free sulfur dioxide was needed to control wine flavor degradation. I was assuming that there may be a correlation some correlation between quality and perhaps the need for additional free sulfur dioxide, but it appears that there is some consistency with the amount of free sulfur dioxide required to maintain flavor.


Final Plots and Summary

Plot One

## w_w$quality: (2,3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## w_w$quality: (3,4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## w_w$quality: (4,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## w_w$quality: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## w_w$quality: (6,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## w_w$quality: (7,8]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## w_w$quality: (8,9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

Description One

The box plot above shows trends among alcohol content compared with quality ratings. This provides a jumping off point in the exploration of understanding balance in white wine taste.

Plot Two

Description Two

Citric acid and residual seems like they would have the most impact on perceived flavor for most people. This multivariate plot explores the variable pair, filled in by quality rating.

Plot Three

Description Three

Alcohol and chlorides also have an affect on the balance of a wine, this multivariate analysis explores both variables filled in by quality rating to further explore flavor profiles.

Reflection

Feeling out and uncovering insights of this dataset via exploratory data analysis is a test of patience, determination and objectivity. In the effort to weigh independent variables against the dependent variable for this white wine dataset we were able to uncover trends among variables related to balance and flavor to better understand how wine sommeliers rate wines.

One area I ran into difficulties with was in the correlation matrix plot. I found that I wasn’t quite successful in ensuring all the components of the matrix were sized right and that the labels were easy to read. Initially I also wanted to practice application of plot colors in the univariate section, but ultimately found the colors to be distracting and resorted to a more simplistic color palette.

Without knowing a great deal of detail regarding the wine quality ratings process I found it difficult to understand the ratings. Through this analysis we were able to uncover the qualities of variables for wine balance (residual sugar, chlorides, volatile acidity, citric acid, and alcohol). In weighing several variables against each other in the multivariate plots, we were able to successfully determine that higher quality wines generally feature higher alcohol and free sulfur dioxide content, and lower chloride and residual sugar content. Citric acid for higher quality white wines exists in a relatively strict concentration range, while residual sugar has a strong correlation with density, which again should be relatively low. And when residual sugar is compared with quality ratings there tends to be some variation among results but an overall trend of lower residual sugar content for higher quality ratings.

Considering the insights gleaned from this exercise it would be possible to further enrich this analysis by using additional data, as in comparing related data for red whine with this data for white wine. This could be used to identify similarites and differences between the two different types of wine, while understanding the balance of flavor for red wine samples. Future efforts to improve this analysis should include the development of a wine quality model to aide in the prediction of wine quality, based on the flavor variables we explored in this analysis. We could use this additional analysis to buy and sample our own collection of wines.